Robust L1 orthogonal regression
نویسندگان
چکیده
Assessing the linear relationship between a set of continuous predictors and a continuous response is a well studied problem in statistics and is applied in many data mining situations. L2 based methods such as ordinary least squares and principal components regression can be used to determine this relationship. However, both of these methods become impaired when multicollinearity is present. This problem becomes compounded when outliers confound standard multicollinearity diagnostics. This work proposes a L1 orthogonal regression method (L1OR) formulated as a nonconvex optimization problem. Solution strategies for finding globally optimal solutions are presented. A simulation study is conducted to determine the robustness of the method to outliers which shows that L1OR is superior to ordinary least squares regression (OLS) and principal components regression (PCR) and is competitive with M-regression (M-R) in the presence of outliers. The new method outperforms OLS, PCR, and M-R on data from an environmental application. Introduction and Background Data miners are often posed with the problem of determining the relationship between several variables and a response variable. Standard techniques available are ordinary least squares regression (OLS), principle components regression (PCR), and non-parametric and semi-parametric regression. When outliers, or unusual observations, are present in data, these regression techniques become impaired. Methods such as M-Regression (M-R) use M estimates reduce the impact of outliers. These methods are not designed for developing errors-in-variables models in which both the predictors and the response have measurement error or are considered random components. An example of this would be measuring pH and Alkalinity in the field which usually have measurement error. Orthogonal regression (OR) is used when uncertainty is known to be present in both independent and dependent variables. In contrast with OLS, where residuals are measured as the vertical distance of observations to the fitted surface, residuals are measured by the orthogonal distances to the fitted surface. Previous Work on Robust Orthogonal Regression The sensitivity of OR to outliers has been noted, and other investigators have worked to develop robust methods (Brown, 1982; Carroll and Gallo, 1982; Zamar, 1989). The work of Zamar (Zamar, 1989) includes the use of S and M estimates for OR. Orthogonal regression can be formulated as equivalent to finding the last principal component, or the direction of minimum variation, in principal component analysis (PCA). Hence, any robust PCA method can be used for robust orthogonal regression. The two main approaches for robust PCA are (1) to find robust estimates of the covariance matrix (in traditional PCA, the principal components are eigenvectors of the covariance matrix) and (2) to use a robust measure of dispersion. Research in the former area includes (Campbell, 1980; Devlin et al, 1981; Galpin and Hawkins, 1987; Naga, 1990; Marden, 1999; Croux and Haesbroeck, 2000; Kamiya and Eguchi, 2001). The latter approach coincides with the work presented here. Robust estimates of dispersion in PCA have been investigated in (Li and Chen, 1985; Xie et al, 1993; Maronna, 2005). Traditional Orthogonal Regression Suppose we are given observations with continuous responses (xi, yi) ∈ Rd × R, i = 1, . . . , n. OR seeks to find an orthogonal projection of the data onto a hyperplane such that the orthogonal distances of the points (xi, yi) to the hyperplane is minimized. We assume throughout this work that the means have been subtracted from samples so that the fitted hyperplane passes through the origin. In OR, the sum of squared orthogonal distances of (xi, yi) to the hyperplane defined by bT (x, y) = 0 is minimized. Finding b involves first optimizing the problem
منابع مشابه
Fitting Two Concentric Circles and Spheres to Data by l1 Orthogonal Distance Regression
The problem of fitting two concentric circles and spheres to data arise in computational metrology. The most commonly used criterion for this is the Least Square norm. There is also interest in other criteria, and here we focus on the use of the l1 norm, which is traditionally regarded as important when the data contain wild points. A common approach to this problem involves an iteration proces...
متن کاملL1-norm Penalised Orthogonal Forward Regression
A l-norm penalized orthogonal forward regression (l-POFR) algorithm is proposed based on the concept of leaveone-out mean square error (LOOMSE). Firstly, a new l-norm penalized cost function is defined in the constructed orthogonal space, and each orthogonal basis is associated with an individually tunable regularization parameter. Secondly, due to orthogonal computation, the LOOMSE can be anal...
متن کاملLeast Squares Optimization with L1-Norm Regularization
This project surveys and examines optimization approaches proposed for parameter estimation in Least Squares linear regression models with an L1 penalty on the regression coefficients. We first review linear regression and regularization, and both motivate and formalize this problem. We then give a detailed analysis of 8 of the varied approaches that have been proposed for optimizing this objec...
متن کاملOutlier-Resistant L1 Orthogonal Regression via the Reformulation-Linearization Technique
Assessing the linear relationship between a set of continuous predictors and a continuous response is a well-studied problem in statistics and data mining. L2-based methods such as ordinary least squares and orthogonal regression can be used to determine this relationship. However, both of these methods become impaired when influential values are present. This problem becomes compounded when ou...
متن کاملA robust autoregressive gaussian process motion model using l1-norm based low-rank kernel matrix approximation
This paper considers the problem of modeling complex motions of pedestrians in a crowded environment. A number of methods have been proposed to predict the motion of a pedestrian or an object. However, it is still difficult to make a good prediction due to challenges, such as the complexity of pedestrian motions and outliers in a training set. This paper addresses these issues by proposing a ro...
متن کامل